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1. Introduction 

The electromagnetic (EM) form-factors (FF) of the nucleon are the quantities which em- 
body the information about the complex electromagnetic structure of the proton and neu- 
tron 1^]. In practice, the form-factors are introduced in order to model (on effective level) 
the electromagnetic hadronic current for elastic ep (n) scattering. In the one photon ex- 
change approximation it has the following form: 



u{p), (1.1) 



where = p' —p denotes the four-momentum transfer; Mp(„) is the proton (neutron) mass; 
p' and p are outgoing and incoming nucleon momenta; = —q^; Ff^"^ is the helicity non- 
flip Dirac proton (neutron) form-factor, while ^2 denotes the helicity-flip Pauli proton 
(neutron) form-factor. The form factors are normalized as follows: 

Ff(0) = l, Ff(0)=^p-1, F^{0) = 0, F^{0) = fin, (1.2) 

where fip^n is anomalous magnetic moment of the proton, neutron. 
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The nucleoli is the many-body system of strongly interacting quarks (three valence 
quarks and any number of quark-antiquark pairs) and gluons. This complex system is de- 
scribed by the QCD (quantum chromodynamics) in the confinement regime. Study of the 
EM form-factors gives an opportunity for testing the models describing the strong interac- 
tions. However, computing the EM form- factors from the first principles is an extremely 
difficult task. Nevertheless, some effort has been done with the effective approaches and 
the lattice QCD. 

A good approximation of the FF is performed within the vector meson dominance 
models (VMD) ||2|, ^. There are interesting results obtained with constituent quark models 
Q as well as with other approaches (see for review Q). However, the given theoretical 
description usually works well only on limited range. In order to describe the full 
domain various approaches must be combined. Hence a proper prediction of the FF in 
wide range requires to use complex phenomenological models which contain plenty of 
internal parameters. 

On the other hand, the experimental data, which have been collected during the last 
sixty years, covers a wide domain and are accurate enough to provide reasonable infor- 
mation about the nucleon electromagnetic structure Therefore one can try to represent 
the nucleon form-factors by the data itself without assuming any model constraints. In 
this article we follow this philosophy. 

Description of the electromagnetic properties of the nucleon is a problem of great 
interest of modern particle physics. The knowledge of the nucleon form-factors is also 
important for practical applications. We mention two of them: (i) predicting the cross 
sections for the quasi-elastic charged current (CC) and elastic neutral current (NC) neutrino 
scattering off nucleon and nucleus 0; (ii) investigation of the strange content of the nucleon 
in elastic lepton scattering off nucleons/nuclei [^, ^. 

An accurate modeling of the neutrino-nucleus cross sections plays a crucial role in the 
analysis of the v^^ — )• neutrino oscillation data, collected in the long-baseline experiments. 
For instance in the experiments like K2K or T2K the neutrino energy spectrum 
is reconstructed from the quasi-elastic- like events. Observing the distortion of the energy 
spectrum in the far detector gives an indication for neutrino oscillation. 

The investigation of the quasi-elastic CC neutrino-nucleon interactions gives an op- 
portunity to explore the axial structure of the nucleon. The weak hadronic current is 
formulated assuming the conserved vector current (CVC) theorem. Then the vector part 
of the current is expressed in terms of the electromagnetic FF of the proton and neu- 
tron, while the axial contribution is described with two axial form factors: Ga and Gp 
(pseudoscalar axial form- factor). The hadronic weak current for the CC vn quasi-elastic 



scattering reads ]12 



u{p), 
(1.3) 

where M = {Mp + M„)/2. The isovector Dirac, Pauli form-factors are defined as follows: 

FUQ') = K2iQ')-FidQ')- (1-4) 
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If the partially conserved vector current hypothesis (PCAC) is assumed then the axial 
form-factors can be related: Gp{Q^) = 4M'^ G a{Q^) / {ml + Q"^). The Ga is usually pa- 
rameterized with dipole functional form: 



Ga{Q^ 



-2 



gA 



-1.2695 ±0.0029. 



(1.5) 



Ma denotes the axial mass. Notice that recent studies |jT^, 14] suggest Ma value larger 



by about 20% with respect to the old measurements |15, 16, 17]. The impact of the 
electromagnetic form-factors on the axial mass extraction is small, but it can play a role 
in the future, when more precise measurements of the neutrino-nucleon cross-sections will 
be performed. 

The precise knowledge of the EM form-factors together with uncertainties is more 
important for predicting the NC elastic vN reaction cross-section. The structure of the 
weak NC hadronic current is similar to (|l.3| ) ]|T^ , namely: 



2Mp(„) ^ 



where 



G 



NCAn)^Q2^ = ±iF^2(Q') - 2 Sin ^^.^^"^(Q^) - ^Fl,{Q% 



1 



1,2 

NC,p{n) 



iQ') = ±Iga{Q') - Ig\{Q') 



u{p), 
(1.6) 

(1.7) 
(1.8) 



9w is the Weinberg angle. F(2{Q'^) and G\{Q'^) describe the strange content of the nu- 
cleon. We see that the investigation of the elastic NC neutrino-nucleon scattering gives 
the opportunity to explore the nucleon strangeness [jl8|, |l^ (mainly the axial strange part). 
The strangeness of the nucleon is also investigated in the elastic ep scattering ||9|, pO| . 
The extraction of this contribution is sensitive to the accuracy of the EM form- factors. 
Therefore it is necessary to use the well determined FF parametrization together with the 
uncertainties. 

There are many different phenomenological parametrizations of the EM form-factors 
]|3|, Some of these are based on the theoretical models, 

but mostly in practical applications simple functional parametrizations fitted to the data 



are applied ]30]. The functional form is chosen to satisfy some general properties (proper 
behavior at — >■ and — )■ oo, scaling behavior). However, a particular choice of the 
parametrization determines the final fit and affects also the uncertainty. The form-factors 
parameterized by the large number of degrees of freedom have a tendency to describe the 
data too accurately, and the generality of the fit is lost. On the other hand, the model 
with a small number of the parameters may describe the data imprecisely. Moreover the 
complexity of the fit has an impact on its uncertainties. 

Searching for the proper parametrization, which describes the data well enough without 
losing the generality of the fit is just solving the problem, known in statistics as hias- 
variance trade-off |3T|, |3^. Usually the most reasonable solution is chosen with a use of 
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common sense, i.e. the fit which leads to the low enough Xmin value is accepted, and 
more complex models are not considered. The task of this paper is to evaluate a model 
independent FF parametrizations, which will not be affected by the problems described 
above i.e. the common sense will be replaced by the objective Bayesian procedure. 

One of the possible fitting techniques is to apply artificial neural networks (ANN). 
The ANN has already been used in the high energy physics for decades |33] and it has 
been shown to be a powerful tool in the field. The pattern recognition tasks like particle 
or interaction identification are efficiently addressed with the ANN based methods also in 
present experiments ^ . The ANN are also applied to the function approximation and 
parameter estimation problems |3^ . 

The ANN techniques have already been applied by NNPDF collaboration to repre- 
sent the nucleon and deuteron EM structure functions |3^, 3£, 41, 42, 43 1. The method 
is based on the large collection of networks |4^] of the same architecture prepared on the 
artificial data sets generated from original experimental measurements. Obtained fits are 
claimed to be unbiased due to networks being intentionally oversized - the number of free 
parameters, the network weights, is larger than required to solve the problem. To avoid 
potential over- fitting (representing the statistical fluctuations of the experimental data), 
that may arise under these conditions, the optimization of the network weights (so-called 
training) is stopped before reaching the minimum of the figure of merit {error function) 
calculated on training data. Stopping condition is based on the cross-validation technique, 
where the portion of available data is excluded from training. Such created subset is then 
used to calculate the test error function which starts to increase when the network becomes 
fitted to training data more than to the testing data. This observation is used to break the 
training. The best fit values and the uncertainties given by the NNPDF are computed by 
taking the average and standard deviation respectively, over the set of solutions obtained 
from the whole collection of the networks. 

In the case of the present analysis the number of experimental points varies from 26 to 
57, and we do not generate the Monte Carlo data. Therefore the cross-validation technique 
is unsuitable because constructing the testing data set can significantly restrict the infor- 
mation about the underlying data model used in training. Additionally our intention is to 
compare statistical models which are represented by the networks of various architectures 
and among them choose the most appropriate parametrization. It motivated us to consider 
another idea for finding the best fit and the choice of the neural network architecture. We 
apply Bayesian framework (BF) for the ANN. It is a different philosophy of building the 
statistical model than the NNPDF approach. However, both techniques are complemen- 
tary and face with the same bias-variance trade-off. A pedagogical description of the main 
ingredients of both methodologies can be found in Ch. Bishop's book |31| (chapters 9 and 
10 respectively). 

In the BF approach the sequence of neural networks characterized by different number 
of hidden units is considered. A given network of a particular size has its specific ability to 
adjust to training data i.e. small networks give smooth approximation, large networks can 
over-fit the data. One can think that the network of a particular architecture represents 
the particular statistical model. With the help of the Bayesian technique we compare the 
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models and choose the most appropriate one. This method has been developed for the 
ANN 1 31, 45, in nineties of last century. We adapted this approach for the purpose 
of minimization. In practice, the so-called evidence is computed for every network type 
in order to select the most appropriate parametrization for given data set. The evidence 
is a probabilistic measure which indicates the best solution. 

The network of particular architecture has weights that need to be optimized i.e. the 
global minimum of the error function is searched for. In order to get the solution we 
consider various gradient algorithms. However, the training done with these algorithms 
can stick in local minimum. Therefore for a given network architecture the sample of 
networks with randomized initial weights is trained to find a single configuration at the 
global minimum (this procedure is described in Sec. ^]^). The error function is modified 
with so-called regularization term to improve generalization ability (to control the over- 
fitting); the extent of regularization is controlled in the statistically optimal way, also as a 
part of the Bayesian algorithm. 

The main results of our studies are unbiased proton and neutron FF parametrizations, 
available in the numerical form at |47| as well as in the analytical ones (see Appendix A). 
The proposed statistical method also allows to compute the form-factor uncertainties (from 
the covariance matrix). One of the strengths of this methodology is its ability of studying 
the deviations of the form-factors from the dipole form. 

Eventually, let us mention that the previous (non-neural) form-factor data analysis 
(with ah-hoc parametrizations) have been done in the non-Bayesian spirit i.e. authors do 
not compare the possible FF parametrizations in order to choose the most suitable. Usually 
the one particular functional form was discussed and analyzed with the framework. 

The paper is organized as follows. In Sec. ^ the feed forward neural networks are 
shortly reviewed. Sec. ^ describes the Bayesian approach to neural networks. The last 
section contains the numerical results and discussion. We supplement the article with the 
appendix, which presents the fits in the analytical form. 



2. Feed Forward Neural Networks 



2.1 Multi-Layer Perceptron 

We consider the feed- forward neural network in the so-called multi-layer perceptron (MLP) 
configuration. The network structure (shown in Fig. ^) contains: the input layer, the layer 
of M hidden neurons and a single neuron in the output layer. We will say that the network 
of type 1-M-l is considered. Each neuron (see Fig. §) calculates the output value as an 
activation function fact of the weighted sum of its inputs: 




where Wi denotes the weight parameter, while represents the output value of the unit 
from previous layer. Neurons in the hidden layer are usually non- linear, with the sigmoid 
or hyperbolic tangent functions denoted as fact', in this analysis the output neuron is linear 
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function. In general, the ANN gives a map (y) of the input into the output vector spaces. 
The overall network response is then a deterministic function of the input variable (vector 
in), and the weight parameters: 

y(m,u;) : 7?.^™p"* 7^^°"*''"*. (2.2) 

In our analysis the ANN is expected to approximate the given form-factor G depending on 
the input variable Q^: 

y{Q\w) = G{Q''). (2.3) 
Let T> denotes the training data set of N points: 

V = {{xi,ti,Ati),...,{xi,ti,Ati),...,{xN,tN,^tN)}, (2.4) 

where ti is the measured value of the nucleon form-factor at the point Xi = Q^, while 
the Ati denotes the total experimental error. The network training goal is to find w that 
minimizes an error function defined here as: 

S{w,V) = x'^{w,V) + aE^{w). (2.5) 

term is the error on data: 



N 



i=l ^ « / 

a parameter is the factor for the regularization term . In this work we apply the weight 
decay formula p9[ : 

1 ^ 

EUw) = -Y.wl (2.7) 

i=l 

where W denotes the total number of weights in the network (including bias weights). 

In general, the output of the MLP with M hidden neurons and the linear output 
neuron can be written in the form: 



M 



y(/io,"-,^L) = Yl 




(2. 



m=0 

In this paper we consider the neural networks with (L = 1): one input unit fii = Q^, and 
one bias unit /Uq = 1 in the first layer. The bias of the output neuron in the above formula 
is considered as the hidden neuron with the constant output, fact = 1- Such representation 



closely corresponds to the Kolmogorov function superposition theorem [48|. Basing on this 



relation it was shown [50, 51 1 that the MLP can approximate any continuous function of 
its inputs, to the extent that depends on the number of the hidden neurons. However, 
in the practical problem we are faced, the desired function is not known and only the 
limited number of experimental points is available instead. It leads to the mentioned earlier 
bias-variance problem. The output of the oversized network tends to approach closely to 
the training data points if weights are not constrained during the training. Usually this 
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means that statistical fluctuations are captured. The weight regularizing term (Eq. |2.7D 
penalizes the large weight values and smooths out the network output, but on the other 
hand, applying the regularization with overestimated value of the factor a leads to the fit 
which does not reproduce significant features of the training data. The effect of applying 
regularization is illustrated in Figs. ^ and ^, where the relatively large network was trained 
with various values of the factor a. Similarly, the network with the low number of the 
hidden neurons may be not capable to represent the desired function. Sec. |3| presents the 
statistical approach to determine the network size appropriate to the given data set and 
to predict the optimal value of a. 

1.1 Training of Network 

It has been already mentioned that the training of the network is the process of establishing 
w which minimizes the error function ( [2.51 ). We denote the minimal error by S^'Wmpj'D) 
(the notation will become clear latter). 

The first algorithm for the MLP weights optimization, the back-prop, was proposed by 
D. E. Rumelhart et al. in [^]. Currently there is a wide range of gradient descent and 
stochastic algorithms available for the network training. We use mainly the Levenberg- 



Marquardt algorithm |53, 54 1, since it converges efficiently and does not require precise 



parameters tuning. However, we trained the networks also with quick-prop |55|, and rprop 



|56] algorithms. The obtained results were very similar. 

The algorithms we use, as all gradient based optimization patterns, may suffer from 
local minima. Therefore for given network type 1-M-l we consider a large sample of 
networks with different (randomized) initial weights. We use a limited range of initial 
weight values according to the properties of the neuron activation function^. 

After the training of the sample of networks of the same type the distribution of the 
total error value S{wmp,'^) is obtained (see Figs. |5| and |6|). Notice that the distribution 
sharply starts at particular Scut value. Such clear cut on the error value gives us an 
indication that the global minimum is well approximated. The number of networks in the 
sample required to determine the clear Scut value depends on the complexity of the data. 
The typical number we obtained were as follows: 150 {Geh data), 250 {Gmp data), 700 
{Gmu data), and 1300 {Gep data). 

The Bayesian framework allows to choose from the sample the best model. It is the 
solution characterized by the highest evidence (as it is described in Sec. ^). In practice, 
if the total error is too big then the evidence is too low and the given network can be 
discarded from further analysis. Hence to simplify the numerical procedure we take into 
consideration ten fits (neural networks) with the lowest total error values. They are also 
characterized by the low \^ value, namely {w m p ,T^) / — W) < 1; N — W is the 
number of degrees of freedom. Among them the one with the maximal evidence is selected 



^High weight values make the sigmoid activation function very steep. Then the neuron input values 
have a very narrow range, where the neuron output is not saturated - this would efficiently block the 
training, where the output derivative is used extensively. Hence we restrict the initial weight range to 
\winitiai \ ~ fact/{Lji), where fact IS the value for which activation function saturates, L is the number of 
neuron inputs, JI is the mean neuron input value. 
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for further comparison with the network of other types. It was interesting to observe that 
the fit parametrizations given by the average over the fits selected by lowest error value 
were found to be very similar to those indicated by the highest evidence in each sample. 
This observation confirms that all solutions we select from the sample are localized in close 
neighborhood of the global minimum and are very similar to the one indicated by the 
highest evidence. 



3. Bayesian Approach to Neural Networks 

The Bayesian framework (BF) for the model comparison |44, 45, 57] is taken into 

consideration. We adapt this framework for ^ minimization purpose. The data is ana- 
lyzed with the set of various neural networks types Am'- 1-M-l. Given neural network of 
architecture Ai corresponds to a particular statistical model (hypothesis) describing data. 
The BF allows to: 

• quantitatively classify the hypothesis; 

• choose objectively the best model (neural network) for representing a given data set; 

• establish objectively the weight decay parameter a (see Eq. |27 



• compute the uncertainty for the neural network response (output), and uncertainties 
for other network parameters. 

The approach in natural way embodies the so-called Occam's razor criterium which penal- 
izes more complex models and prefers simpler solutions. 

3.1 Bayesian Algorithm 

At the beginning of the fitting procedure every neural network architecture Am is classified 
by the prior probability V{^Am\ After the training of the network with the data P, the 
posterior probability is evaluated V [Am\ ^) i-e. a probability of the model Am given data 
P. It classifies quantitatively considered hypothesis. 

On the other hand applying the Bayes' theorem allows to express the posterior prob- 
ability in the following way: 

where: 

V {V\ Am) (3.2) 

is called evidence |Q (probability of the data D given Am)- 

There is no reason to prefer some particular model before starting data analysis, hence: 

V{Ax)=V(^A2) = -=V{Am) = - (3.3) 



Then if one neglects the normalization factor V{T>) the evidence ( |3.2[) is the probability 
distribution which quantitatively classifies hypothesis. 
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The evidence is constructed in so called hierarchical approach. It is a three level pro- 
cedure. Applying Bayes' theorem the probability distribution for the weights parameters 
is constructed, then the probability distribution of the decay parameter a, and eventually 
the evidence are evaluated. 

T>/-\T, A ^ V(P\w,a,AM)V{w\a,AM) , .^ 

V [T>\ a, Am) 
'V(r.\7^ A \ V{V\a,AM)V{a\AM) ^ 

T^(A \7^\ ^(^1 Am)V{Am) 

V[Am\ V) = ^^^^ . (3.6) 

Below the short description of the Bayesian approach is presented. 
1. Constructing the weight parameter distribution 

The probability distribution for the neural network weights is built, assuming that regu- 
larization parameter a is fixed: 

-n(-\T> A \ ^ (P| w, g. Am) V (w| a. Am) „^ 

V[w\V,a,AM) = Tg/^i . ^ , 3.7) 

V \V\ a, Am) 

where V {w\ a, Am) is a prior probability distribution of weights, while V (P| w, a, Am) is 
the likelihood function. In the case of present analysis the likelihood function is given by 
the function, namely: 

1 r N ^ 

V{V\iv,a,AM) = ^exp[-xHw,V)], Z^= / d^t exp[-x2(^iJ, P)] = vrf JJ Ai^. 

(3.8) 

The prior probability should be as general as possible. Indeed, there are plenty of possibil- 
ities (e.g. Laplacian or entropy-based priors see discussion in Ref. |58|). We assume that 
every weight parameter is equally distributed according to a Gaussian distribution (with 
the zero mean and the variance of 1/ yfa) 

1 /2 \ ^ 

V{w\ a, Am) = „ , . exp[-a£'^,], Zt„(a;) = / d^wex.p[-aE^^] = I — ) (3.9) 
Zw{a) J V « / 



(the arguments supporting above choice of the prior are presented in Sec. 3.2). It gives 
the probabilistic interpretation for the regularization function Eyj defined in the previous 
section (see Eq. \l.7\). Then we see that: 



ViV\a,AM)= [ d'^wV{V\w,a,AM)riw\a,AM) = 4¥^' (3-10) 
(27r)~ 

Zm{o) = ~^J^ [-x{wMp) - aEi,{wMp)] ■ (3.11) 
The last integral was computed by expanding the error function up to the Hessian term: 
S{w,V) = S{wMP-,T^) + ^{w - WMp)^ A{w - WMp), (3.12) 
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where wmp is the vector of weights which minimizes S{w,T>) (maximizes the posterior 
probabihty (|3^)). 

The Hessian matrix reads 



Aj = ViVj S| . . ^ = ViV,x'{w, -D) + aSij (3.13) 



I W=WMP 

N 



k=l 



Viyixk,WMp)^jyixk,WMp) , iy{xk,WMp) -tk)^ ^ , ^ . 

^^2 + ^^2 ^i^ iy{Xk,WMP] 



(3.14) 



We compute the full Hessian matrix ||5^. Usually the double differential term in (3.14) is 
neglected, which is a good approximation only at the minimum. Taking into account full 
Hessian plays a crucial role in optimizing a parameter, as it will become clear below. 
The network response uncertainty Ay is defined by the variance: 

(Ay(x))2 = y d'^w [y(x, w) - {y{x))fV {w\ a, P, Am) ■ (3.15) 

In the first approximation it is expressed by the covariance matrix, i.e. inverse of the 
Hessian matrix: 

{Ay{x)f = {Vy{x, wmp)V A-^Vy{x, wmp). (3.16) 

In Appendix A the covariance matrices obtained for every considered problem are pre- 
sented. 

2. Constructing a the distribution of the parameter a 



The a parameter is established by applying the so-called evidence approximation |44, 45 



60(1 , the method, which is equivalent to type II maximum likelihood in conventional statis- 
tics. 

The B ayes' rule leads to: 

-P^^l-n ^ 7^(P| a,AM)V{a\ Am) 

V H v^Am) = ^jjv[Am) ' ^'-''^ 



where the V {D\ a, Am) has been obtained in the previous section (see Eq. 3.1C1| ) 



We are searching for the omp parameter, i.e. the one which maximizes the prior 
probability ( p7| ). It can be shown that in the Hessian approximation it is given by the 
solution of the equation: 

2aMpEyj{wMp) = T — — = 7, (3.18) 

^ Xi + OMP 

1=1 

where Aj's are eigenvalues of the matrix Vn'^m • I^i practice, the eigenvalues 

depend on a, therefore to get a proper ump the a parameter is iteratively changed during 
the training process i.e.: 

ak+i = 7{ak)/2Ey,{w). (3.19) 
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The iteration procedure fixes in the optimal way the a parameter. The typical dependence 



of afc on the iteration step is presented in Fig. ^. In Sec. 3.2 it is shown that the choice of 
the initial a value has a small impact on the final results. 



At the end of the training procedure one can approximate ( 3.10| ) as follows: 

(In a — \uaMp) 



V {V\ Xxia^Au) = V {V\ InQMPj^M) exp 
where in the Hessian approximation o'l^a ~ 2/7. 



2^2 

In a 



(3.20) 



3. Constructing the evidence 

The evidence for given model is defined by denominator of (3.17). If one assumes the 



uniform prior distribution of In a parameter^ on some large In Q region then the evidence 
can be approximated by: 

V {V\ Am) « V {V\ aMP, A) (3.21) 

The In Q is a constant which is the same for the all hypotheses. 

The In of evidence (we show only model independent terms) reads 

2 1 W 1 'J 

InV {V\ Am) ~ -X {wMp) - OiMpEw{wMp) - - In |A| + — In aMP ~ 2 2 ' (^-^^^ 

The first term in the above expression, —x^{iSMp)-, (usually of low- value for simple models) 
is the misfit of the approximated data, while the next four terms constitute the so called 
Occam factor, which penalizes the complex models. Since in this work we consider only 
the networks of type 1-M-l (only one hidden layer) in the rest of the paper we will denote 
the evidence V {V\ Am) V {V\ M). 

3.2 Prior Function 

We have already mentioned that the various possible prior distributions are considered in 
the literature [^]. In this analysis the likelihood function is given by distribution, which 
has a Gaussian probabilistic interpretation. Therefore it seems to be reasonable to assume 
that the weight parameters distribution should also be described by the Gaussian-like prior 
function. Additionally we assume, without losing the generality, that: 

• negative, and positive values of the weight parameters are equally likely; 

• at the beginning of the learning procedure the weight parameters are independent; 

• small^ weight values are more likely than the large values. 



^It is the consequence of the fact that a is the scale parameter. 

^For the networks with the sigmoid activation functions the non-trivial smooth functional parametriza- 
tion are described by the low \wi\ weights. 
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Then the Gaussian-hke prior distribution can have a form: 



V {iS\ a, Am) ~ exp 



w 



(3.23) 



Notice that every Wi parameter has its own regularization parameter. As it was men- 
tioned in the previous section the a is the so-cahed scale parameter. The number of the 
scale parameters can be reduced if the symmetry property of the given network architecture 
is taken into account. The network of the type 1-M-l has: M hidden weights; M corre- 
sponding bias weights and M + 1 linear weighs (output weights + one bias parameter). 
The permutation between the hidden units does not change the network functional type. 
Permuting two hidden units is realized by exchange between the weight parameters of the 
same type (hidden, bias and linear weights). This symmetry property allows us to reduce 
the number of a's to three independent scale parameters: 

• a/j for the hidden weights; 

• Of, for bias weights (in hidden layer); 

• ai for linear weights in the output layer. 
Then the prior function reads 



V {w\ a, Am) ~ exp 



\ i£ hidden 



+ ab Y wf + ai ^ 



bias 



i£ linear 



(3.24) 



We made an effort to compare results which are obtained with both (|3j) and (|334| ) 
priors. It was observed that final results are very similar. Analogically as in the case 
of ( f3.9| ) prior the ah, Ob and ai parameters were iteratively changed during the training 
procedure. The typical results, obtained for the Gmu/ fJ-uGo and Gsn data sets, are shown 
in Fig. ^. The differences between the final best fits are negligible. In the left column of 
the same figure we plot the dependence of the S{w,V) on the iteration step. We see that 
the minimal value of the total error is almost the same for both prior functions. For both 
cases the training started from the same initial weight configuration. 

All above seem to justify the simplest choice of the prior function, namely the one given 



by Eq. 3.9. Nevertheless, it may happen that for more complex data then we discuss, the 



results will significantly depend on prior assumptions. In such case the Bayesian framework 
can be used to indicate the best prior function. 

Eventually, we discuss the dependence of the final results on the initial ao value. We 
considered several initial values of oq (see Table |^). After training we noticed that the 
choice of the initial ao had a small impact on the final ump value (see Fig |8|) as well as the 
fits. It is shown in Table [l| where the relative distances, in the weight space, between fits 
are presented. Notice that the only one solution computed for ao = 1 is out of others. 

It is worth to mention that decreasing the ao parameter can be understood as enlarging 
the effective prior domain. For the final analysis we set oq = 0.001. 



- 12 - 



"0 


U.UUUi 


U.UUi 


U.Ui 


n 1 
U.i 


i 


0.0001 





0.0925 


0.0196 


0.8048 


14.8847 


0.0010 


0.0925 





0.0748 


0.8921 


14.9641 


0.0100 


0.0196 


0.0748 





0.8214 


14.8965 


0.1000 


0.8048 


0.8921 


0.8214 





14.2768 


1.0000 


14.8847 


14.9641 


14.8965 


14.2768 






Table 1: The distance d{wi,W2) = yX^i^iC^ii ~ w;2j)^ between fits obtained for various initial 
ao values. The computations are done for the Geu data for the network of 1-2-1 type. 



In this section we have demonstrated that our results weakly depend on the prior as- 
sumptions. It has been also shown that it is relatively easy to construct the prior function if 
the symmetry properties of network are taken into consideration. Usually, it is not the case 
in the conventional form-factor data analysis, where the ad-hoc parametrizations are dis- 
cussed. The typical phenomenological parametrization has no straightforward symmetries. 
As an example consider the function |25, 62 1: 

""^"^ ^~ bo + 6iQ2 + b2Q^ + 63Q6 + b^gs • l-^-^^^ 

Constructing the prior function for above form-factor parametrization seems to be more 
complicated than in the ANN case. One can postulate the values of the ratios ao/^o 
and 02/64, which describe the low and high behavior of the FF. However, the rest 
of parameters, which seem to model the intermediate region, can have any arbitrary 
values. Therefore building the prior distribution for above FF would require an extra 
phenomenological and theoretical knowledge. 



4. Form-Factor Fits 



4.1 Data 



We consider the electric and magnetic proton and neutron form-factor data. The electric 
and magnetic nucleon form- factors are defined as follows: 



gmpAq') = FriQ') + FriQ'), (4.1) 

Ge.AQ') = FriQ') - ^FriQ% (4.2) 

where: 



GMp,n — ^J'p,n, Gep — 1, Gev. — 0. (4-3) 

The experimental data is usually normalized to the dipole form-factor Gd = 1/(1 + 
The electric Gep and magnetic Gmp proton FF data have been obtained via Rosenbluth 



separation technique from elastic ep scattering [33|. Additionally since the beginning of 



nineties of last century the measurement of the form- factor ratio ^j-pGep/Gmp in the spin 



dependent elastic ep scattering have been performed 164 1. 
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It turned out that systematic discrepancy 
exists between so-called Rosenbluth and polar- 
ization transfer fipGEp/Gup ratio data. The 
difference can be explained when the two pho- 



ton exchange effect (TPE) |65] is taken into ac- 
count (for review see [^]). Hence, a proper fit 
of the EM form-factors requires to take into ac- 
count the TPE correction pO| ]. In this work we 
consider the re-analyzed (TPE corrected Rosen- 
bluth) Gmp/ fJ-pGo and Gep/Gd data (Tabs. 2 
and 3 of Ref. [^]). However, to see the TPE 
effect we consider also the original, (called here 
old Rosenbluth data) Gmp/ fJ-pGn |6^, |68| 
and Gep/Gd ^ data sets^. The neu- 

tron form-factor data {Geu and Gmu) are ob- 
tained from the electron scattering off light nu- 
clei (deuteron [^], helium |7^). Since the com- 
plexity of nuclear target, getting nucleon form- 
factors is more demanding than in the case of 
the elastic ep scattering. The ground and fi- 
nal states of the nucleon must be properly de- 
scribed. In this analysis we consider the same 
Geu and Gun/ fJ'uGo data sets as in Ref. 



error of 
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error 


point (At) 






Q'=0 


1.0000 


1.01809 


0.03258 


0.1000 


1.01633 


0.03097 


0.0100 


1.00198 


0.00947 


0.0010 


1.00002 


0.00075 


0.0001 


1.00000 


0.00009 


Q2=o.1 


1.0000 


0.96920 


0.00700 


0.1000 


0.96893 


0.00696 


0.0100 


0.96709 


0.00642 


0.0010 


0.96650 


0.00624 


0.0001 


0.96986 


0.00599 


Q2=i.o 


1.0000 


1.03657 


0.00792 


0.1000 


1.03669 


0.00797 


0.0100 


1.03720 


0.00815 


0.0010 


1.03778 


0.00813 


0.0001 


1.03575 


0.00773 



30 



Table 2: Dependence of Gmu/ f^nGo and 
its uncertainty (computed for —0, 0.1, 
and 1 ) on the At of the artificial point added 
at g2 = 0. 



Let us mention that to obtain proper fits of 
the form-factors at = we added to every 
data set one artificial point, namely (Q^ = 0, t = 
1, At = 0.001) for GMn/f^nGn, GMp/npGo and 

Gep/Gd data sets, and (Q^ = 0, t = 0, At = 0.001) for Geu data set. This constraints 
have an effect on the final fit value and the uncertainty only in the close surrounding of 
the added point, as it is shown in Table where we present how the best fit values and 
its uncertainties depend on the artificial point error. We present results for Gmu data 
but for other considered data sets we got analogical conclusions. The At value assigned 
to the additional point should be comparable to data uncertainties used in the network 
training. We have found that using At = 0.01 and higher is not sufficient to attract the fit 
to desired value at constraint point, while At = 0.0001 causes numerical difficulties during 
the training since the point has dominant contribution to the overall network error value. 

4.2 Numerical Procedure 

The numerical analysis was done with two independent neural network softwares (in order 
to cross-validate the results). One written by R.S. and P.P. [^] and another, which has 
been developed by K.M.G. [||]. 

The procedure for finding the best neural network model for each data set consists of 
the five major steps: 

''We used the JLab data-base 
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1. 



the sequence of networks of different 1-M-l type (1-1-1, 1-2-1, 1-3-1, ... etc.) is taken 
into consideration; 



2. for each network of 1-M-l type the sample of networks with randomly initiated 
weights is trained; 

3. among the networks obtained in the previous step, ten networks with the lowest total 



error are selected for further analysis, see Sec. 2^ and the S{wmp,T^) distributions 
shown in Figs. |5| and ^; 

4. the network (from the step above) with the highest evidence is chosen as the best fit 
candidate for given network type; 

5. the best fits obtained for every network type are compared; the one with the highest 
evidence is chosen to represent the data. 

Let us remind that in the second step the large number (from 150 to 1300) of networks in 
the sample (as it is explained in Sec. p.2| ) is considered in order to find the solutions which 
maximizes the posterior probability for the given model. 

The procedure for the single network training is as follows (see Fig. P): 

• initialize the network weights as small random values; 

• initialize the regularization factor (Eq. |2.7| ), in this analysis ao = 0.001; 

• perform the network training iterations, according to the Levenberg-Marquardt, quick- 
prop, or rprop algorithms; 

• calculate the updated regularization factor oik+i (Eq. |3.1S| ) every 20 iterations of the 
training algorithm; eigenvalues of Hessian matrix below 10~^ are rejected from the 
evaluation of 7(0^) (Eq. |3.18| ); 

• calculate the network output (Eq. p. 3D and uncertainty (Eq. |3.16| ) values for the 
given range of values; 

• calculate the In of evidence (Eq. |3.22| ) 

Eventually, we will shortly highlight the major differences between the NNPDF ap- 
proach and the one presented in this article. 

In this work we consider the sequence of networks with graded number of hidden units. 
With the help of the Bayesian framework the best solution is chosen. The NNPDF group 
considers one particular network architecture (2-5-3-1 type) to fit the data [41|. But some 



discussion of the dependence of results on the network architecture is presented. 

The NNPDF group prepares the sample of the networks. Each network from the 
sample is trained with the artificial data which is Monte Carlo generated from the original 
measurements. Then the best fit and its uncertainty are obtained as an average and 
standard deviations computed over the sample. In this work every network is always 
trained with the original data set. Nevertheless the large sample of networks of given 
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type is prepared but in order to find the architecture and the weights which maximize the 
evidence. The network response uncertainty is computed from the covariance matrix (Eq. 

Both approaches deal with the over-fitting problem but in different ways. The NNPDF 
applies the early stopping in the training (cross-validation algorithm is imposed). Whereas 
we consider the regularization penalty term in the error function, which is optimized by 
the Bayesian procedure. Hence the approach we apply does not require validation of the 
solutions by comparing with the test data set. 

4.3 Numerical Results 

The numerical procedures described in the previous section were applied to all (six) the 
data sets. We consider networks with M = 1 — 5 hidden units for Gmu-, Geu-, and Gep 
data and with M = 1 — 6 for the Gmp data. The evidence quantitatively classifies the 
networks i.e. the most suitable network architecture for representing the data is indicated 
by the maximum of the evidence. Notice that the optimal way to deal with these results 
would be taking an average over all solutions weighted by the evidence. However, in all 
problems considered here we obtained clear signal (a peak at the evidence) for particular 
solution. It allowed us to neglect the contribution from networks of other size. 

We start the presentation of the numerical results by the discussion of the Gmti/ fJ-nGo 
FF data. As it was described above, we consider a set of networks, which differ by number 



of hidden units M. In Fig. 10 we show the scatter plot presenting the dependence of 
given network size on error function and log of evidence. One can notice that the networks 
1-2-1 and 1-3-1 have the highest evidences, but the networks with M = 2 hidden units are 
not able to reproduce as low total error value as 1-3-1 networks. It is interesting also to 
mention that for M > 3 the total error slowly varies, i.e. increasing the number of the 
hidden units lowers the total error by the minor amount. The clear indication for 1-3-1 
network type is seen in Fig. where only dependence of InP (2?| M) on M is shown. 
In this figure we plot the maximal evidences obtained for given network type. However, 
in order to control the stability of numerical procedure we plot also the In of evidence 
averaged over the networks around global minimum (solutions selected in step 3, Sec. ^^), 
as well as the In of the minimal values of VCDlAi). 

All together suggest the network of type 1-3-1 (with the highest evidence) for the best 



fit of the Gmu data. The network output is drawn in Fig. 12 together with the experimental 



data. The neural network response uncertainty is computed with ( ^.16 ) expression and 



shown in Fig. 13. In Fig. ^ we plot also the best fits obtained for networks: 1-1-1, 1-2-1, 
1-4-1 and 1-5-1. As could be expected increasing the number of hidden units makes the fit 
more flexible. 

The electric neutron FF data {Geu) is analyzed in the same way as the magnetic neu- 
tron one. In Figs. |l^ |l^ and the plots of evidence and Geti form-factor are shown. For 
M = 2 we obtained the peak of the Occam's hill, what indicates 1-2-1 network architecture 
as the most representative parametrization. 



The results for the electric and magnetic FF data are presented in Figs. 17, 18 and 



21, |22| (scatter and evidence plots) and Figs. 19 and |2^ (form-factor plots). The network 



-16- 



of type 1-3-1 is preferred by the both electric and magnetic data sets. As it has been 
mentioned above we analyzed also the old form-factor data, which are not TPE corrected. 
It was obtained that the old Gep prefers representation by the network of type 1-1-1. 
Hence, the old Rosenbluth Gep/Gd data fit is almost linear constant function in Q^. But 
the data seems to be not conclusive enough, so the Bayesian procedure leads to the simplest 
possible solution. On the other hand, it means that the old proton electric data does not 
show clear indication for deviation from the dipole form. 

4.4 Summary 

We have analyzed the form- factor data by the means of the artificial neural networks. The 
Bayesian approach has been adapted for the minimization and then applied to the data 
analysis. For every form-factor data set sequence of neural networks have been considered. 
The Bayesian approach provided us with an objective criteria for choosing the most suitable 
form-factor parametrization (neural network) with the statistically optimal balance of the 
fit complexity and its uncertainty. Therefore the resulting fits are unbiased and model 
independent. It has been demonstrated also that the final results weakly depend on the 
prior assumptions. 

The approach allowed to investigate objectively the non-dipole deviations of the form- 
factors. It is interesting to mention that the Gep/Gd-, Gmp/ IJ-pGd as well as Gmu/ fJ-nGo 
form- factor data prefer the same type (size) network 1-3-1. The form- factor parametriza- 
tions, obtained in this analysis can be easily applied to any phenomenological and exper- 
imental analysis. Additionally, a part of the our software used in the analysis is available 
at [i3,|7|. 

Presented method seems to be a promising statistical framework for studying and 
representing the experimental data. Especially, if the theoretical predictions are not able 
to reproduce measurements with desired accuracy, but the experimental data is sufficiently 
comprehensive to describe physical quantity by itself. 
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A. Analytical Formulae 

The two parametrizations of the form-factors have been obtained. The network of the type 
1-2-1 representing Geu- 

GEn{Q^) = W^ifactiQ^Wl + W2) + VJQfactiQ^W^, + W4) + Wt , (A.l) 

and the network of the type 1-3-1, representing Gmu/ fJ-nGo, Gep/Gd and Gup/ fJ-pGo- 



Gf{Q^)/gGD = WjfactiQ^Wi + W2) + WsfactiQ^m + W4,) + WgfactiQ^W^ + Wq) + WiQ, 

f = Mm, Ep, Mp, (A.2) 



where g = 1 for proton electric form-factor and g = jip^n for the proton, neutron magnetic 
form-factors. The activation function reads 

1 



fact{x) = 



1 + exp(— x) 



The weights obtained for Geu- 



(A.3) 



(10.19704, 2.36812, -1.144266, -4.274101, 0.8149924, 2.985524, -0.7864434) 



(A.4) 



with the covariance matrix: 



/ 77182.936 
-141838.399 
1007.74 
-881.971 
-106233.936 
-524.981 
V 106231.986 



-76674.953 
158041.683 
1987.396 
1138.086 
117881.486 
-282.967 
-117915.687 



11320.149 
-17763.896 
2153.904 
154.164 
-13555.199 
-492.929 
13528.692 



-976.911 
1808.806 

94.542 
2326.543 
1345.44 
-6713.68 
-1347.504 



-59149.683 
121039.907 
1216.369 
841.299 
90326.25 
-132.119 
-90347.707 



-510.459 
875.737 
99.514 

-6673.131 
660.27 

19769.326 

-661.274 



59023.698 
-120845.155 

-1244.23 

-844.274 
-90176.259 
138.861 

90199.073 



(A.5) 



The weights obtained for Gmu/ tJ-nGo'- 

iSjjp = (3.19646, 2.565681, 6.441526, -2.004055, -0.2972361, 3.606737, -3.135199, 0.299523, 1.261638, 2.64747) 

with the covariant matrix: 



(A.6) 



13019.47 
-110.632 
1186.146 

2412.026 
1688.083 

-15867.205 
9913.113 
-424.63 
-824.766 

-7934.871 



5437.135 
2389.64 
1096.726 

476.941 
877.17 
-7087.878 
-1376.838 
-308.244 
638.712 
1406.406 



1625.832 
-1419.064 
6283.129 

-2382.018 
-114.146 
-406.067 
7840.949 
-386.653 
-1287.206 
-6188.958 



2407.977 
1007.869 
-2368.423 

2014.753 
433.604 
-3161.81 
-2871.111 
125.67 
628.491 
2288.342 



2421.111 
68.926 
-32.076 

486.682 
445.447 
-3587.665 
1670.017 
-74.709 
250.047 
-1669.7 



-9226.711 
748.578 
97.767 

-1841.165 
-1374.196 
16599.626 
-9080.441 
273.379 
4601.007 
3594.294 



-5508.625 
-8134.262 
331.547 

-1386.815 
-1269.078 
7510.982 
18045.64 
250.767 
-3463.662 
-16306.448 



-466.761 
132.414 
-371.507 

128.242 
-43.863 
544.902 
-1025.567 
48.032 
142.212 
813.979 



11858.122 
1320.643 
-167.415 

2393.05 
2404.463 
-14185.447 
5532.985 
-378.848 
6694.3 
-10862.434 



-5018.992 \ 
6692.985 
188.748 

-961.438 
-943.785 
4857.133 

-21923.165 
53.352 

-3266.938 
24785.914 

(A.7) 



The weights obtained for Gep/Gd- 



■tSjjp = (3.930227, 0.1108384, -5.325479, -2.846154, -0.2071328, 0.8742101, 0.4283194, 2.568322, 2.577635, -1.185632) 

(A.8) 

with the covariance matrix: 



/ 36866.41 
' -68432.176 

14227.233 
-35928.337 
181.767 
6881.763 
-23847.659 
22789.127 
3700.817 
17126.803 



-62184.005 
103227.83 

-16215.276 
57408.998 
-354.001 
-8970.135 
36615.2 

-36227.846 
-6332.644 

-26781.151 



17354.196 
-25251.786 

26518.973 
-6198.14 

27.284 
2549.152 
-6410.826 
11996.228 
791.696 
4320.733 



-9375.943 
22614.051 

2749.674 
18394.036 
-99.739 
-1714.537 
8820.603 
-14431.928 
-1652.194 
-6631.983 



693.671 
-1329.167 
160.818 
-745.922 
55.716 
128.027 
-475.037 
441.824 
760.047 
-137.16 



7949.39 
-14160.394 

2066.422 
-7749.442 

-65.584 
2720.834 
-5101.518 
4699.686 
-46.633 
3552.661 



-16876.298 
34458.11 

-2921.857 
20171.669 

-118 
-3012.463 
12518.641 
-11630.739 
-2130.248 
-9216.18 



-11986.299 
24954.878 
2228.451 
8522.062 
-98.718 
-2059.728 
9338.177 
8794.696 
-1683.296 
-6918.134 



10641.393 
-19958.73 - 

2430.186 
-11194.428 
696.136 
2495.438 
-7159.334 
6644.126 
9852.346 
-1252.964 

(A.9) 



4687.308 \ 
-11910.498 

-38.278 
-7642.097 
-337.281 
-329.115 
-4418.53 
4132.864 
-4813.821 
7966.677 



The weights obtained for GMp/fJ-pGo- 



wlfp = (-2.862682, -1.560675, 2.321148, 0.1283189, -0.2803566, 2.794296, 1.726774, 0.861083, 0.4184286, -0.1526676) 

(A.IO) 

with the covariant matrix: 



15709.171 
3284.282 
1993.96 
-3841.3 
-88,252 
935.496 
14256.242 
6227.654 
-67.875 
-6204.745 



6861.227 
2803.079 
1142.807 

-1694.722 
-54.872 
585.08 
5279.888 
2640.608 
-43.08 

-2625.836 



2766.185 
843.705 
495.778 

-859.519 
-31.981 
339.546 

2236.735 
867.323 
-25.232 

-868.408 



-6126.712 
-1333.839 

-954.548 

2450.449 
60,267 

-538.917 
-4498.246 
-1228.301 
39.496 

1214.946 



-121.496 
-40.306 
-31.697 
47.229 
8.866 
-76.046 
-93.725 
-32.179 
6.797 
28.68 



1318.866 
438.59 
341.671 
-515.322 
-76.635 
720.701 
1012.932 
348.663 
-64.859 
-324.213 



8737.945 
817.679 
836.303 
-1122.054 
-33.677 
349.467 
9909.287 
3680.716 
-24.96 
-3672.214 



4008.92 
1156.474 

462.66 
-167.405 
-16.192 
169.075 
4366.293 
2148.384 
-12.176 
-2144.257 



-94.694 
-31.955 
-25.013 
37.338 
6.739 
-53.896 
-72.416 
-24.917 
6.271 
21.06 



-3978.666 \ 
-1145.984 
-454.17 
155.266 
12.676 
-146.686 
-4343.104 
-2140.484 

8.358 
2139.095 / 

(A.ll) 
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Figure 1: The feed forward neural network Figure 2: Single neuron, 

(of type 1-4-1) with one hidden layer, one 
input and output unit and 4 hidden units, 
representing the form- factor G{Q^). 
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Figure 3: Fits of the Gmu/ f^nGo data 
parametrized with the network of large size. 
The results were obtained with: fixed, under- 
estimated value of a (red line); fixed, overes- 
timated value of a (violet line); online opti- 
mized value of a (green line). 
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Figure 4: The Gmu/ IJ-nGo uncertainties (of 
the fits presented in Fig. 3) computed with 
(3.16). 
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Figure 5: S{wmp,'D)/N distribution ob- 
tained for the network sample trained with 
the Gep/Gd data. The 1-3-1 network type 
was applied. 
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Figure 6: S{wmp,T^)/N distribution ob- 
tained for the network sample trained with 
the Gmp/ fJ-pGo data. The 1-3-1 network 
type was applied. 
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Figure 7: Left panels: S{w,'D) dependence on the iteration step. Right panels: the best fits 



obtained for Gmu/ ^-uGd and Geu data. The results obtained with ( |3.24| ) prior are denoted by 
green lines, while the results computed for the (3^) prior function are plotted with blue lines. For 
the magnetic neutron data the network of 1-3-1 type was trained. The electric neutron data was 
analyzed with 1-2-1 network type. 
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Figure 8: Dependence of iteration of a parameter on the initial ao value. The results were obtained 
for the 1-2-1 network type trained with Geu data. 
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Figure 9: Learning schema. 
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Figure 10: The total error, S{wmp), as a 
function of ln7^(2?| M) (In evidence). The 
evidence is computed for networks trained 
with G Mn / fJi'uG D data. The results obtained 
for networks with M = 1 — 5 hidden units 
are shown. Single point represents the fit ob- 
tained for given starting weight configuration 
and particular network type. 



Figure 11: The dependence of In-p (X*! M) 
on the number of hidden units. The evi- 
dence is computed for networks trained with 
Gmu/ t^'uGu data. The maximal and mini- 
mal values of InP M) (for given network 
type) are plotted with the red and green lines 
respectively. The mean of ln7'(2?| M) over 
all acceptable solutions is represented by the 
blue line. 
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Figure 12: Fits of the G mu/ IJ-nG d data 
parametrized with networks of 1-1-1 (green 
line), 1-2-1 (violet line), 1-3-1 (blue line), 
1-4-1 (cyan line) and 1-5-1 (magenta line) 
types. The best fit (shown with la uncer- 
tainty), which was indicated by the maximal 
evidence, is given by 1-3-1 network. The blue 
area denotes fit uncertainty computed with 
(3.16). The experimental data is the same 
as the one discussed in Ref . . 
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Figure 13: The fit uncertainty computed 
(with Eq. 3.16) for the parametrizations 
shown in Fig. 12. 
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Figure 14: The total error, S{wmp), as a 
function of InV {T>\ M) (In evidence). The 
evidence is computed for networks trained 
with the Geu data. The results obtained 
for networks with Af = 1 — 5 hidden units 
are shown. Single point represents the fit ob- 
tained for given starting weight configuration 
and particular network type. 



Figure 15: The dependence of In-p {V\ M) 
on the number of hidden units. The evidence 
is computed for networks trained with the 
Geu data. The maximal and minimal values 
of \nV {T>\ M) (for given network type) are 
plotted with the red and green lines respec- 
tively. The mean oiXnV [V] M) over all ac- 
ceptable solutions is represented by the blue 
line. 
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Figure 16: The best fit of Geu data given by the 1-2-1 network. The blue area denotes fit 



uncertainty computed with Eq. 3.16. The experimental data is the same as the one discussed in 
Ref. 130|. 
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Figure 17: The total error, S{wmp), as a 
function of ln7^(2?| M) (In evidence). The 
evidence is computed for networks trained 
with the Gep/ Gd data. The results obtained 
for networks with Af = 1 — 5 hidden units 
are shown. Single point represents the fit ob- 
tained for given starting weight configuration 
and particular network type. 



Figure 18: The dependence of In-p {V\ M) 
on the number of hidden units. The evi- 
dence is computed for networks trained with 
the Gep/Gd data. The maximal and mini- 
mal values of \x\V [V] M) (for given network 
type) are plotted with the red and green lines 
respectively. The mean of \nV {T>\ M) over 
all acceptable solutions is represented by the 
blue line. 
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Figure 19: The best fit of Gep/Gd data. 
The fit to TPE corrected data is given by 1- 
3-1 network (blue line), the data (red points) 
is taken from The fit to "old Rosen- 

bluth data" (green points) is given by 1-1-1 
network (violet line), the data is taken from 
j6|, |6^, |6^ . The fit uncertainty is computed 
with Eq. 3.16. 
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Figure 20: The best fit of Gmp/ ^J-pGo data 
given by the 1-3-1 network. The fit to TPE 
corrected data is given by 1-3-1 network (vi- 
olet line), the data (red points) is taken from 
|62). Thefitto"oldRosenbluthdata" (green 
points) is given by 1-1-1 network (violet line), 
the data is taken from [|3[ The fit 

uncertainty is computed with Eq. 3.16. 
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Figure 21: The total error, S{'Wmp), as a 
function of ln'P(I?| M) (In evidence). The 
evidence is computed for networks trained 
with GmpI IJ-pGo data. The results obtained 
for networks with M = 1 — 6 hidden units 
are shown. Single point represents the fit ob- 
tained for given starting weight configuration 
and particular network type. 
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Figure 22: The dependence of X-rV {V\ M) 
on the number of hidden units. The evi- 
dence is computed for networks trained with 
Gmp/ fJ-pGo data. The maximal and mini- 
mal values of InT^ (2?| M) (for given network 
type) are plotted with the red and green lines 
respectively. The mean of hiV {T>\ M) over 
all acceptable solutions is represented by the 
blue line. 
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